More Features Are Not Always Better: Evaluating Generalizing Models in Incident Type Classification of Tweets
نویسندگان
چکیده
Social media represents a rich source of upto-date information about events such as incidents. The sheer amount of available information makes machine learning approaches a necessity for further processing. This learning problem is often concerned with regionally restricted datasets such as data from only one city. Because social media data such as tweets varies considerably across different cities, the training of efficient models requires labeling data from each city of interest, which is costly and time consuming. In this study, we investigate which features are most suitable for training generalizable models, i.e., models that show good performance across different datasets. We reimplemented the most popular features from the state of the art in addition to other novel approaches, and evaluated them on data from ten different cities. We show that many sophisticated features are not necessarily valuable for training a generalized model and are outperformed by classic features such as plain word-n-grams and character-n-grams.
منابع مشابه
A Model for Detecting of Persian Rumors based on the Analysis of Contextual Features in the Content of Social Networks
The rumor is a collective attempt to interpret a vague but attractive situation by using the power of words. Therefore, identifying the rumor language can be helpful in identifying it. The previous research has focused more on the contextual information to reply tweets and less on the content features of the original rumor to address the rumor detection problem. Most of the studies have been in...
متن کاملMHSubLex: Using Metaheuristic Methods for Subjectivity Classification of Microblogs
In Web 2.0, people are free to share their experiences, views, and opinions. One of the problems that arises in web 2.0 is the sentiment analysis of texts produced by users in outlets such as Twitter. One of main the tasks of sentiment analysis is subjectivity classification. Our aim is to classify the subjectivity of Tweets. To this end, we create subjectivity lexicons in which the words into ...
متن کاملClassifier Ensemble Framework: a Diversity Based Approach
Pattern recognition systems are widely used in a host of different fields. Due to some reasons such as lack of knowledge about a method based on which the best classifier is detected for any arbitrary problem, and thanks to significant improvement in accuracy, researchers turn to ensemble methods in almost every task of pattern recognition. Classification as a major task in pattern recognition,...
متن کاملSemantic Abstraction for generalization of tweet classification: An evaluation of incident-related tweets
Social media is a rich source of up-to-date information about events such as incidents. The sheer amount of available information makes machine learning approaches a necessity to process this information further. This learning problem is often concerned with regionally restricted datasets such as data from only one city. Because social media data such as tweets varies considerably across differ...
متن کاملPresenting a three-objective model in location-allocation problems using combinational interval full-ranking and maximal covering with backup model
Covering models have found many applications in a wide variety of real-world problems; nevertheless, some assumptions of covering models are not realistic enough. Accordingly, a general approach would not be able to answer the needs of encountering varied aspects of real-world considerations. Assumptions like the unavailability of servers, uncertainty, and evaluating more factors at the same ti...
متن کامل